Combined quantized and continuous feature vector HMM approach to speech recognition

ABSTRACT

A device capable of achieving recognition at a high accuracy and with fewer calculations and which utilizes an HMM. The present device has a vector quantizing circuit generating a model by quantizing vectors of a training pattern having a vector series, and converting the vectors into a label series of clusters to which they belong, a continuous distribution probability density HMM generating circuit for generating a continuous distribution probability density HMM from a quantized vector series corresponding to each label of the label series, and a label incidence calculating circuit for calculating the incidence of the labels in each state from the training vectors classified in the same clusters and the continuous distribution probability density HMM.

This is a continuation of application Ser. No. 08/076,192 filed on Jun.14, 1993 abandoned.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to an HMM generator, HMM memory device,likelihood calculating device and recognizing device for a novel HMM(Hidden Markov Model) that is applicable to such pattern recognitions asspeech recognition.

2.Related Art of the Invention

Although an HMM is generally applicable to the time-series signalprocessing field, for convenience of explanation, it is described belowin relation to, for example speech recognition. A speech recognizingdevice using HMM will be described first. FIG. 1 is a block diagram of aspeech recognizing device using HMM. A speech analyzing part 201converts input sound signals to feature vectors at a constant timeinterval (called a frame) of, for example, 10 msec, by means of aconventional method such as a filter bank, Fourier transformation andLPC analysis. Thus, the input signals are converted to a feature vectorseries Y=(y(1), y(2), ———, y(T)), where T is a number of frames. Acodebook 202 holds labeled representative vectors. A vector quantizingpart 203 substitutes labeled representative vectors corresponding to theclosest presented vectors registered in the codebook 202 for respectivevectors of the vector series Y. An HMM generating part 204 generates HMMcorresponding to words that constitute a recognition vocabulary fromtraining data. In other words, to generate HMM corresponding to a wordv, an HMM structure (the number of states and transition rulesapplicable between the states) is firstly appropriately designated, anda state transition probability in the models and incidence probabilityof labels occurring in accordance with the state transition aredeveloped from a label system obtained by a multiplicity ofvocalizations of the word v such that the incidence probability of thelabel series is maximized. An HMM memory part 205 stores the HMM thusobtained for the words. A likelihood calculating part 206 calculates thelikelihood of respective models stored in the HMM memory part 205 to thelabel series thereof. A comparison and determination part 207determines, as a recognition result, words corresponding to models thatprovide the highest value of likelihood of the respective models.

More specifically, the recognition by HMM is performed in the followingmanner. When a label series O obtained for an unknown input is assumedto be O=(o(1), o(2), ———o(T)), a model λ corresponding to a word v,λ^(v), a given state series of length. T generated by the model λ^(v),is X=(x(1), x(2), ———, x(T)), the likelihood of λ^(v) to the labelseries O is defined as:

[Exact Solution] $\begin{matrix}{{L_{1}(v)} = {\sum\limits_{x}{P\left( {O,\left. X \middle| \lambda^{v} \right.} \right)}}} & \left\lbrack {{formula}\quad 1} \right\rbrack\end{matrix}$

[Approximate Solution] $\begin{matrix}{{L_{2}(v)} = {\max\limits_{x}\left\lbrack {P\left( {O,\left. X \middle| \lambda^{v} \right.} \right)} \right\rbrack}} & \left\lbrack {{formula}\quad 2} \right\rbrack\end{matrix}$

or logarithmically as: $\begin{matrix}{{L_{3}(v)} = {\max\limits_{x}\left\lbrack {\log \quad {P\left( {O,\left. X \middle| \lambda^{v} \right.} \right)}} \right\rbrack}} & \left\lbrack {{formula}\quad 3} \right\rbrack\end{matrix}$

where P(x,y|λ^(v)) is a joint occurrence probability of x, y in modelλ^(v).

Therefore, in the following expression using formula 1, for example;$\begin{matrix}{v^{\hat{}} = {\underset{v}{argmax}\left\lbrack {L_{1}(v)} \right\rbrack}} & \left\lbrack {{formula}\quad 4} \right\rbrack\end{matrix}$

V{circumflex over ( )} is a recognition result. Formulae 2 and 3 can beused in the same manner.

P(O, X|λ) can be obtained in the following manner.

When an incidence b_(i)(o) of label o and a transition probabilitya_(ij) from a state q_(i) (i=1˜I) to state q_(j) (j=1˜I+1) are given bystate q_(i) for q_(i) (i=1˜I) of HMM, a simultaneous probability ofcoincidence of state series X=(x(1), x(2), ———,s(T)) and label seriesO=(o(1), o(2), ———,o(T)) from HMM λ is defined as: $\begin{matrix}{{P\left( {O,\left. X \middle| \lambda \right.} \right)} = {\pi_{x{(1)}}{\prod\limits_{t = 1}^{T}{a_{{x{(t)}}\quad {x{({t + 1})}}}{\prod\limits_{t = 1}^{T}{b_{x{(t)}}\left( {o(t)} \right)}}}}}} & \left\lbrack {{formula}\quad 5} \right\rbrack\end{matrix}$

where π_(x(1)) is an initial probability of state x(1). Incidentally,x(T+1)=I+1 is a final state, and it is assured that no label isgenerated there.

In the example, although an input feature vector y is converted to alabel, the feature vector y can be directly used alternatively for theincidence probability of vector y in each state, and in this case theprobability density function of vector y can be given for each state. Inthis case, a probability density b_(i)(y) of the feature vector y isused in place of the incidence probability b_(i)(o) of the label o inthe state q_(i)(hereinafter), when z is assumed to be a label, b_(i)(z)defines a probability generated with z in a state of i, and, when z isassumed to be a vector, and b_(i)(z) defines a probability density of z.In this case, the formulae 1, 2 and 3 are expressed as:

[Exact Solution] $\begin{matrix}{{L_{1}^{\prime}(v)} = {\sum\limits_{x}{P\left( {Y,\left. X \middle| \lambda^{v} \right.} \right)}}} & \left\lbrack {{formula}\quad 6} \right\rbrack\end{matrix}$

[Approximate Solution] $\begin{matrix}{{L_{2}^{\prime}(v)} = {\max\limits_{x}\left\lbrack {P\left( {Y,\left. X \middle| \lambda^{v} \right.} \right)} \right\rbrack}} & \left\lbrack {{formula}\quad 7} \right\rbrack\end{matrix}$

or logarithmically as; $\begin{matrix}{{L_{3}^{\prime}(v)} = {\max \left\lbrack {\log \quad {P\left( {Y,\left. X \middle| \lambda^{v} \right.} \right)}} \right\rbrack}} & \left\lbrack {{formula}\quad 8} \right\rbrack\end{matrix}$

Thus, in any of the methods, when HMM λ^(v) is prepared for each word v,where v=1˜V, a final recognition result for an input sound signal, Y is:$\begin{matrix}{v^{\hat{}} = {\underset{v}{argmax}\left\lbrack {P\left( Y \middle| \lambda^{v} \right)} \right\rbrack}} & \left\lbrack {{formula}\quad 9} \right\rbrack\end{matrix}$

or $\begin{matrix}{v^{\hat{}} = {\underset{v}{argmax}\left\lbrack {\log \quad {P\left( Y \middle| \lambda^{v} \right)}} \right\rbrack}} & \left\lbrack {{formula}\quad 10} \right\rbrack\end{matrix}$

where Y is, of course, an input label series, feature vectorial seriesand the like according to the respective methods.

In such conventional examples, a method of converting input featurevectors to labels is hereinafter referred to as a discrete probabilitydistribution HMM, and another method of using input feature vectors asthey are as a continuous probability distribution HMM. Features of theseare described below.

It is an advantage of the discrete probability distribution HMM that thenumber of calculations is fewer when calculating likelihood of a modelto an input label series, because the incidence probability b_(i)(C_(m))of a label in state i can be run by reading from a memory device whichprestores the incidence probabilities in relation to the labels, butrecognition accuracy is inferior and therefore creates a problem due toerrors associated with quantization. In order to prevent this problem,it is necessary to increase the number of labels(the number of clusters)although the number of learning patterns required for learning themodels accordingly becomes significant. If the number of learningpatterns is insufficient, b_(i)(C_(m)) may frequently be 0, and correctestimation cannot be obtained. For example, the following case mayoccur.

In the preparation of a codebook, speeches vocalized by multiplespeakers are converted to a feature vector series for all words to berecognized, the set of feature vectors are clustered, and the clustersare respectively labeled. Each of the clusters has its representativevector called a centroid, which is generally an expected value of thevectors classified to the clusters. A codebook is defined as thecentroids stored in a form retrievable by the labels.

Now, it is assumed that a word “Osaka”, for example, is present in therecognition vocabulary, and a model corresponding to it is prepared.Voice samples corresponding to the word “Osaka” that are vocalized bymultiple speakers are converted to a feature vector series, each of thefeature vectors is compared with the centroid, and the labelcorresponding closest to the centroid is choosen as the Vector Quantizedvalue of the feature vector. In this way, the voice samplescorresponding to the word “Osaka” are converted to a label series. Byestimating an HMM parameter from the resultant label series in such amanner that the likelihood to the label series is maximized, a modelcorresponding to the word “Osaka” is obtained. For the estimation, amethod known as the Baum-Welch algorithm can be used.

In this case, some of the labels in the codebook might not be includedin the learning label series corresponding to the word “Osaka”. Theincidence probability of such labels that are not included therein isassumed to be “0” during the learning process. Therefore, it is verylikely that labels not included in the label series used for modeling ofthe word “Osaka” may be present in a label series to which vocalizationof the word “Osaka” is converted in recognition. In such a case, theincidence probability of a label series of the word “Osaka” vocalized inrecognition from the model of the word “Osaka” comes to be “0”. Even insuch a case, however, sometimes a vocalization is different in the labelstate, but relatively close to a voice sample used in model learning inthe feature vector state before it is converted to a label, andsufficient to be recognized as “Osaka” in terms of the vectorial state.Such a problem is possibly caused, even if it is similar at thevectorial level, since even though originally the same word isvocalized, it can be converted, at the labeling level, to an absolutelydifferent label because of a slight difference, and it is easilypredicted that this adversely affects recognition accuracy. As thenumber of clusters is increased, and the number of training data isdecreased, such a problem is more frequently caused.

In order to eliminate such a problem, smoothing and interpolation isrequired for labels that are not shown (included) in a training set.Although various methods are suggested for the smoothing andinterpolation as a means of reducing the number of parameters by using aconcept called “tied states”, and methods of substituting for aprobability, if it is estimated to be 0, a small amount withoutestimating it as being 0, and equivocating boundaries of clusters suchas fuzzy vector quantization, none of them can fundamentally solve theproblem. In addition, there are such elements that must be empiricallydetermined for particular cases, and no theoretical guideline for thedetermination of such elements is suggested.

On the other hand, in the continuous probability distribution HMM, adistribution profile for each state is given beforehand in the form of afunction such as a normal distribution, and parameters defining thefunction are estimated from the learning data. Therefore, the number ofparameters to be estimated is fewer, a parameter can be estimatedaccurately through learning patterns fewer than those of the discretetype, smoothing and interpolation are eliminated, and it is reportedthat a recognition ratio higher than that of the discrete type isgenerally obtained.

For examples, when the number of parameters is compared between thediscrete and continuous types in HMM of a 4 state and 3 loop arrangementas shown, for example in FIG. 4, the following result is recognized. Inthe case of a discrete type, if 256 types of labels are used, 256×3=768for incidence probability of labels and 6 for transition probabilityresult. Thus, 774 in total are required for one model. In the case of acontinuous type, if it is a 10 dimensional normal distribution, 10×3=30for an average vector, 55×3=165 for a variance-covariance matrix(because of symmetric matrix) and 6 for a transition probability result.Thus a total of 201 are required, and therefore the number of parametersto be estimated is approximately {fraction (1/4+L )} or less in thecontinuous type compared with the discrete type.

However, the fact that the number of calculations is significantlyincreased in the continuous type in comparison with the discrete type,although the continuous type is superior in recognition accuracy aproblem is still created. In other words, if an input feature vectory(t) has an average vector μ_(i) in the state i and a normaldistribution of a variance-covariance matrix Σ_(i) for calculatingincidence probability (density) of y(t) in state i, a calculation of(y(t)−μ_(i))^(T)Σ_(i) ³¹ ¹(y(1)−μ_(i)) is required. Thus, in the case ofa 10 dimensional HMM continuous type, for example, only 110multiplications are required for this calculation, and it is multipliedby (the number of states times the number of input frames) for onemodel. Therefore, when the number of input frames of the model isestimated to be 50, the number of multiplications required in thecalculation of (y(t)−μ_(i))^(T)Σ_(i) ⁻¹(y(1)−μ_(i)) is 110×3×50=16500.Then, if the number of words is 500, it is multiplied by 500. That meansthat 8,250,000 multiplications are required solely for this portion.

In the case of the discrete type, after calculations for quantizingvectors are completed, it is only required to read an incidenceprobability of the label from the memory device according to the labelas described. In the above example, calculations required for vectorquantization of y(t) are those of distance or similarity between 256representative vectors and y(t). If the distance is (Euclideandistance)², the calculations required for the labeling of y(t) are 256times 10 subtractions, 10 multiplications and 10 additions. Therefore,in the case of 50 frames, for multiplication only, 10×256×50=128000calculations should be performed. If the vector quantization isperformed according to a method called binary search, the figure 256 isreplaced by 2 log₂256=16, and the number of calculations comes to10×16×50=8000.

Accordingly, the number of calculations is remarkably reduced by thediscrete type, and the number of calculations is increasedproportionally as the number of recognition words is increased in thecontinuous type, such calculations are required only once for vectorquantization of input sound signals in the discrete type, and the numberof calculations is unchanged even when the recognition words areincreased.

In summary, the discrete type has a problem in recognition accuracy,although the number of calculations is fewer, while the continuous typehas a problem in the number of calculations, though the recognitionaccuracy is superior.

SUMMARY OF THE INVENTION

Hence, in the light of such problems in conventional HMM, it is anobject of the present invention to provide an HMM generator, HMM memorydevice, likelihood calculating device and recognizing device capable ofproviding superior recognition accuracy and reducing the number ofcalculations.

An HMM generator of the present invention comprises:

vector quantizing means for quantizing vectors of a training patternhaving a vectorial series, and converting the vectors into a labelseries of clusters to which they belong,

continuous distribution probability density HMM generating means forgenerating a continuous distribution probability density HMM from aquantized vector series corresponding to each label of the label series,and

label incidence calculating means for calculating incidence of thelabels in each state from the training vectors classified in the sameclusters and the continuous distribution probability density HMM.

A likelihood calculating device of the present invention comprises:

the above mentioned vector quantizing means for converting the vectorseries to a label series by substituting labels for vectors of a featurevector series that constitutes an input pattern, and

likelihood calculating means for calculating, from a state transitionprobability and label incidence stored in the HMM memory devicelikelihood of HMM described by parameters stored in the HMM memorydevice to the input pattern.

In an HMM generator according to the present invention, vectorquantizing means quantizes vectors of a training pattern having a vectorseries, and converts the vectors to a label series of clusters to whichthey belong, continuous distribution probability density HMM generatingmeans generates a continuous distribution probability density HMM from aquantized vector series corresponding to each label of the label series,and incidence of a label in each state is calculated from the trainingvectors classified in the same clusters and the continuous distributionprobability density HMM.

Additionally, in a likelihood calculating device according to thepresent invention, a vector series is converted to a label series bysubstituting labels for vectors of a feature vector series constitutinga pattern by an input of a vector quantizing means, and the likelihoodof the HMM to an input pattern is calculated from label incidence ineach state of the HMM generated by the HMM generator.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram explaining a speech recognizing device usingHMM of the prior art.

FIG. 2 is a part of a block diagram showing an embodiment of a devicefor estimating a parameter of HMM according to the present invention.

FIG. 3 is the remaining part of the block diagram showing an embodimentof a device for estimating a parameter of HMM according to the presentinvention.

FIG. 4 is a structural diagram of an HMM showing a structure of acontinuous probability distribution type HMM.

FIG. 5 is a block diagram showing an embodiment of a speech recognizingdevice using HMM that is constructed according to the present invention.

PREFERRED EMBODIMENT OF THE INVENTION

An embodiment of the present invention is described below by referringto the drawings.

The definition of the symbols used hereinbelow are first described. Forsimplicity, states q_(i), q_(j) and the like are indicated simply as i,j and the like to avoid misunderstanding. As for a model learning, it isdescribed for a word v, and a subscript v is gadded, when identificationis required, in the upper right portion of a parameter, but omitted ingeneral circumstances. The symbols are described below.

i=1, 2, . . . , I+1: State number i

[a_(ij)]: Transition matrix

a_(ij): Transition probability from state i to state j

r: Training pattern number (r=1, . . . , R) to word v

y^((r))(t): Observation vector in the frame number t of the trainingpattern r

o^((r))(t): Observation label in the frame number t of the trainingpattern

b_(i)(y^((r))((t)): Probability density in the state i of the vectory^((r))(t) of the frame t of the training pattern r

b_(i)(o^((r))(t)): Incidence (probability, probability density etc.) inthe state i of the observation label o^((r))(t) of the frame t of thetraining pattern

y^((r))=(y^((r))(1), y^((r))(2), . . . , y^((r))(T^((r)))): Vectorseries of the training pattern r (where r=1, 2, . . . , R)

O^((r))=(o^((r))(1), o^((r))(2), . . . , o^((r))(T^((r)))): r-th Labelseries to word v (where r=1, 2, . . . , R)

X^((r))=(x^((r))(1), x^((r))(2), . . . , x^((r))(T^((r))),X^((r))(T^((r))+1)): State series corresponding to X^((r)) or O^((r))

X^((r))(t): State of r-th training pattern in the frame number t to theword v

T^((r)): The number of frames of r-th training pattern to the word v

^(μ) _(i): Average vector of b_(i)(y)

Σ_(i): Variance-covariance matrix of b_(i)(y)

ξ_(i): Set of parameters (ξ_(i)={μ_(i), Σ_(i)}) defining a probabilitydistribution of the observation vector in the state i

λ_(i)=[ξ_(i), {a_(ij)}_(j=i, ———, I+1)]: Set of parameters in the statei

λ={λ_(i)}: Set of all parameters (a model having parameter

λ is also referred to as “model λ”)

P(Y|λ): Probability density of the observation vector series Y occurringfrom the model λ

P(O|λ): Probability of observation label series O occurring from themodel λ

π_(i): Probability of the state i at t=1

First of all, a method of learning the continuous probabilitydistribution HMM for word v is described. It is an issue to estimate aparameter λ that maximizes a likelihood function P(Y⁽¹⁾, Y⁽²⁾, ———,Y^((R))|λ) to a training pattern of r=1˜R prepared for a word v.

If Y^((r)) is mutually independent, it is given by; $\begin{matrix}\begin{matrix}{{P\left( {Y^{(1)},Y^{(2)},\cdots \quad,\left. Y^{(R)} \middle| \lambda \right.} \right)} = \quad {\prod\limits_{r = 1}^{R}{P\left( Y^{(r)} \middle| \lambda \right)}}} \\{= \quad {\prod\limits_{r = 1}^{R}\left\{ {\sum\limits_{x^{(r)}}{P\left( {Y^{(r)},\left. X^{(r)} \middle| \lambda \right.} \right)}} \right\}}} \\{= \quad {\sum\limits_{x^{(1)}}\quad {\cdots \quad {\sum\limits_{x^{(R)}}{\prod\limits_{k = 1}^{R}{P\left( {Y^{(k)},\left. X^{(k)} \middle| \lambda \right.} \right)}}}}}}\end{matrix} & \left\lbrack {{formula}\quad 11} \right\rbrack\end{matrix}$

Here, an auxiliary function Q(λ, λ′) is defined. $\begin{matrix}\begin{matrix}{{Q\left( {\lambda,\lambda^{\prime}} \right)} = \quad {\sum\limits_{x^{(1)}}\quad {\cdots \quad {\sum\limits_{x^{(R)}}{\left\lbrack {\prod\limits_{k = 1}^{R}{P\left( {Y^{(k)},\left. X^{(k)} \middle| \lambda \right.} \right)}} \right\rbrack \times}}}}} \\{\quad {\log \left\lbrack {\prod\limits_{k = 1}^{R}{P\left( {Y^{(k)},\left. X^{(k)} \middle| \lambda \right.} \right)}} \right\rbrack}}\end{matrix} & \left\lbrack {{formula}\quad 12} \right\rbrack\end{matrix}$

For this, the following relation is obtained. If Q(λ, λ′)≧Q(λ, λ),P(Y⁽¹⁾, ———, Y^((R))|λ′)≧P(Y⁽¹⁾, ———, Y^((R))|λ) and equality isattained when λ′=λ. $\begin{matrix}{\lambda^{*} = {\underset{\lambda^{\prime}}{argmax}\left\lbrack {Q\left( {\lambda,\lambda^{\prime}} \right)} \right\rbrack}} & \left\lbrack {{formula}\quad 13} \right\rbrack\end{matrix}$

Therefore, if formula 13 is obtained, by repeatedly applying the formula13 with λ*→λ, λ converges at a stationary point of P(Y⁽¹⁾), ———,Y^((R))|λ), that is, a point giving the maximum value or saddle point ofP(Y⁽¹⁾, ———, Y^((R))|λ), and then a local optimum solution can beobtained by repeating the operation until the rate of change of P(Y⁽¹⁾),———, Y^((R))|λ) comes to be equal to or less than a predeterminedthreshold value.

Succeedingly, a method of estimating a parameter by using Q(λ, λ′) isdescribed.

By transforming the formula 12, the following formula is obtained.$\begin{matrix}\begin{matrix}{{Q\left( {\lambda,\lambda^{\prime}} \right)} = \quad {{P\left( {Y^{(1)},\cdots \quad,\left. Y^{(R)} \middle| \lambda \right.} \right)}{\sum\limits_{r = 1}^{R}{\frac{1}{P\left( Y^{(r)} \middle| \lambda \right)} \times}}}} \\{\quad {\sum\limits_{x^{(r)}}{{P\left( {Y^{(r)},\left. X^{(r)} \middle| \lambda \right.} \right)}\log \quad {P\left( {Y^{(r)},\left. X^{(r)} \middle| \lambda^{\prime} \right.} \right)}}}}\end{matrix} & \left\lbrack {{formula}\quad 14} \right\rbrack\end{matrix}$

From the above description, if λ′ that provides Q(λ, λ′)>Q(λ, λ) isfound by estimating that Q(λ, λ′) is a function of λ′, as it is anupdate of λ, and P(Y⁽¹⁾, ———, Y^((R))|λ) is a constant value for λ′, andtherefore when it is removed as; $\begin{matrix}\begin{matrix}{{Q^{\prime}\left( {\lambda,\lambda^{\prime}} \right)} = \quad {{Q\left( {\lambda,\lambda^{\prime}} \right)}/{P\left( {Y^{(1)},\cdots \quad,\left. Y^{(R)} \middle| \lambda \right.} \right)}}} \\{= \quad {\sum\limits_{r = 1}^{R}{c^{(r)}{\sum\limits_{x^{(r)}}{{P\left( {Y^{(r)},\left. X^{(r)} \middle| \lambda \right.} \right)}\log \quad {P\left( {Y^{(r)},\left. X^{(r)} \middle| \lambda^{\prime} \right.} \right)}}}}}}\end{matrix} & \left\lbrack {{formula}\quad 30} \right\rbrack\end{matrix}$

it corresponds to finding λ′ that provides Q(λ, λ′)>Q(λ, λ). Here,however, it is estimated that; $\begin{matrix}{{{\xi_{ij}^{(r)}(t)} = {P\left( {Y^{(r)},{{x^{(r)}\left( {t - 1} \right)} = i},{{x^{(r)}(t)} = \left. j \middle| \lambda \right.}} \right)}}{\gamma_{i}^{(r)} = {(t) = {{P\left( {Y^{(r)},{{x^{(r)}(t)} = \left. i \middle| \lambda \right.}} \right)} = {{\sum\limits_{j = 1}^{i + 1}{{\xi_{ij}^{(r)}(t)}c^{(r)}}} = {{1/{P\left( Y^{(r)} \middle| \lambda \right)}} = {1/{\sum\limits_{i = 1}^{1}{\gamma_{i}^{(r)}(t)}}}}}}}}} & \left\lbrack {{formula}\quad 15} \right\rbrack\end{matrix}$

The formula 14 can be further transformed as; $\begin{matrix}\begin{matrix}{{Q^{\prime}\left( {\lambda,\lambda^{\prime}} \right)} = \quad {\sum\limits_{r = 1}^{R}{c^{(r)}{\sum\limits_{x^{(r)}}{{P\left( {Y^{(r)},\left. X^{(r)} \middle| \lambda \right.} \right)} \times}}}}} \\{\quad \left\{ {{\log \quad \pi_{x^{(r)}{(1)}}^{\prime}} + {\sum\limits_{t}{\log \quad a_{{x^{(r)}{(t)}}{x^{(r)}{({t + 1})}}}^{\prime}}} +} \right.} \\\left. \quad {\sum\limits_{t}{\log \quad {b_{x{(t)}}\left( {y^{(r)}(t)} \right)}^{\prime}}} \right\} \\{= \quad {{\sum\limits_{r = 1}^{R}{c^{(r)}{\sum\limits_{i}{{P\left( {Y^{(r)},{{x^{(r)}(1)} = \left. q_{i} \middle| \lambda \right.}} \right)}\log_{i}^{\prime}}}}} +}} \\{\quad {\sum\limits_{r = 1}^{R}{c^{(r)}{\sum\limits_{t}{\sum\limits_{i}{\sum\limits_{j}{P\left( {Y^{(r)},{{x^{(r)}(t)} = q_{i}},} \right.}}}}}}} \\{{\left. \quad {{x^{(r)}\left( {t + 1} \right)} = \left. q_{j} \middle| \lambda \right.} \right)\log \quad a_{ij}^{\prime}} +} \\{\quad {\sum\limits_{r = 1}^{R}{c^{(r)}{\sum\limits_{t}{\sum\limits_{i}{P\left( {Y^{(r)},} \right.}}}}}} \\{\left. \quad {{x^{(r)}(t)} = \left. q_{i} \middle| \lambda \right.} \right)\log \quad {b_{i}\left( {y^{(r)}(t)} \right)}^{\prime}}\end{matrix} & \left\lbrack {{formula}\quad 16} \right\rbrack\end{matrix}$

 x ^((r))(t)=q;|λ)log b; (y ^((r))(t))′

When it is maximized for π_(i)′ from the first term on the right side, are-estimated value π_(i)* of π_(i) is; $\begin{matrix}{\pi_{i}^{*} = {\sum\limits_{r = 1}^{R}{c^{(r)}{{\gamma_{i}(1)}.}}}} & \left\lbrack {{formula}\quad 17} \right\rbrack\end{matrix}$

When it is maximized for a_(ij)′ from the second term on the right side,a re-estimated value a_(ij)* of a_(ij) is; $\begin{matrix}{a_{ij}^{*} = {\frac{\sum\limits_{r = 1}^{R}{c^{(r)}{\sum\limits_{t = 1}^{T^{(r)}}{\xi_{ij}^{(r)}(t)}}}}{\sum\limits_{r = 1}^{R}{c^{(r)}{\sum\limits_{t = 1}^{T^{(r)}}{\sum\limits_{j = 1}^{J + 1}{\xi_{ij}^{(r)}(t)}}}}}.}} & \left\lbrack {{formula}\quad 18} \right\rbrack\end{matrix}$

When it is maximized for μ_(i)′ and Σ_(i)′ from the third term on theright side, re-estimated values μ_(i)* and Σ_(i)* of μ_(i) and Σ_(i),respectively, are; $\begin{matrix}{{{\mu \quad}_{i}^{*} = \frac{\sum\limits_{r = 1}^{R}{c^{(r)}{\sum\limits_{t = 1}^{T^{(r)}}{{\gamma_{i}^{(r)}(t)}{y^{(r)}(t)}}}}}{\sum\limits_{r = 1}^{R}{c^{(r)}{\sum\limits_{t = 1}^{\gamma^{(r)}}{\gamma^{(r)}{i(t)}}}}}},} & \left\lbrack {{formula}\quad 19} \right\rbrack\end{matrix}$

$\begin{matrix}{{\sum i^{*}} = {\frac{\sum\limits_{r = 1}^{R}{c^{(r)}{\sum\limits_{t = 1}^{T^{(r)}}{{\gamma_{i}^{(r)}(t)}\left( {{y^{(r)}(t)} - \mu_{i}} \right)\left( {{y^{(r)}(t)} - \mu_{i}} \right)^{T}}}}}{\sum\limits_{r = 1}^{R}{c^{(r)}{\sum\limits_{t = 1}^{T^{(r)}}{\gamma_{i}^{(r)}(t)}}}}.}} & \left\lbrack {{formula}\quad 20} \right\rbrack\end{matrix}$

Here, ξ^((r)) _(ij)(t) can be calculated as follows. That is, if suchconditions as; $\begin{matrix}\left. \begin{matrix}{{\alpha_{i}^{(r)}(t)} = {P\left( {{y^{(r)}(1)},\cdots \quad,{y^{(r)}(t)},{{x^{(r)}(t)} = \left. i \middle| \lambda \right.}} \right)}} \\{{\beta_{i}^{(r)}(t)} = {P\left( {{y^{(r)}\left( {t + 1} \right)},\cdots \quad,{\left. {y^{(r)}(T)} \middle| {x^{(r)}(t)} \right. = i},\lambda} \right)}}\end{matrix} \right\rbrack & \left\lbrack {{formula}\quad 21} \right\rbrack\end{matrix}$

are estimated; $\begin{matrix}\begin{matrix}{{\xi_{ij}^{(r)}(t)} = {P\left( {Y^{(r)},{{x^{(r)}\left( {t - 1} \right)} = i},{{x^{(r)}(t)} = \left. j \middle| \lambda \right.}} \right)}} \\{= {{\alpha_{i}^{(r)}\left( {t - 1} \right)}a_{ij}{b_{j}\left( {y^{(r)}(t)} \right)}{\beta_{j}^{(r)}(t)}}} \\{{\gamma_{i}^{(r)}(t)} = {{\alpha_{i}^{(r)}(t)}{\beta^{(r)}(t)}}}\end{matrix} & \left\lbrack {{formula}\quad 22} \right\rbrack\end{matrix}$

Thus, the following recurrence formulae are obtained. $\begin{matrix}{{\alpha_{i}^{(r)}(t)} = {\sum\limits_{i}{{\alpha_{i}^{(r)}\left( {t - 1} \right)}a_{ij}{b_{j}\left( {y^{(r)}(t)} \right)}}}} & \left\lbrack {{formula}\quad 23} \right\rbrack\end{matrix}$

$\begin{matrix}{{\beta_{i}^{(r)}(t)} = {\sum\limits_{j}{{b_{j}\left( {y^{(r)}(t)} \right)}a_{ij}{\beta_{j}^{(r)}(t)}}}} & \left\lbrack {{formula}\quad 24} \right\rbrack\end{matrix}$

Therefore, by giving an appropriate initial value to λ estimatingα^((r)) ₁(1)=1, and successively calculating α^((r))j(t) fort=1˜T^((r))+1, j=1˜I+1 according to the formula 23, and β^((r)) _(i)(t)for t=T^((r))+1˜1, i=I˜1 according to the formula 24 by estimatingβ^((r)) _(I+1)(T^((r))+1)=1, the formula 15 can be calculated.

The actual procedure for estimating a parameter comprises the steps of:

(1) L₁=∞

(2) giving an appropriate initial value to λ_(i){(a_(ij))_(j=1), ———,_(I+1), μ_(i), Σ_(i)} for i, j=1˜I;

(3) calculating α^((r)) _(i)(t) for r=1˜R, t=2˜T^((r)), i=1˜I+1according to the formula 23 by estimating λ={λ_(i)};

(4) calculating β^((r)) _(i)(t) and ξ^((r)) _(ij)(t) for r=1˜R,t=2˜T^((r)), i=1˜I+1 according to the formulae 24 and 22, respectively,by estimating λ={λ_(i)};

(5) calculating for r=1−R, i, j=1˜+1;

numerator: a_(ij,num)(r), μ_(i,num)(r), Σ_(i,num)(r); and

denominator: Den_(i)(r)=a_(ij,denom)(r)=μ_(i,denom)(r)=Σ_(i,denom)(r) ofthe formulae 18, 19 and 20;

(6) calculating re-estimated values a_(ij)*, μ_(i)* and Σ_(i)* ofa_(ij), μ_(i) and Σ_(i) according to the following formulae:$\begin{matrix}{{a_{ij}^{*} = {\sum\limits_{r}{c^{(r)}{{a_{{ij},{num}}(r)}/{\sum\limits_{r}{c^{(r)}{{Den}_{i}(r)}}}}}}},{\mu_{i}^{*} = {\sum\limits_{r}{{\mu_{i,{num}}(r)}/{\sum\limits_{r}{c^{(r)}{{Den}_{i}(r)}}}}}},{\Sigma_{i}^{*} = {\sum\limits_{r}{c^{(r)}{{\Sigma_{i,{num}}(r)}/{\sum\limits_{r}{c^{(r)}{{Den}_{i}(r)}}}}}}}} & \left\lbrack {{formula}\quad 25} \right\rbrack\end{matrix}$

(7) by substitutions of a_(ij)=a_(ij)*, μ_(i)=μ_(i)* and Σ_(i)=Σ_(i)*for i,j=1˜I+1, obtaining a set of re-estimated parameters λ={λ_(i)};

(8) calculating for the set of parameters λ for r=1˜R, t=2˜T^((r)),i=1˜I+1 obtained in step (7); $\begin{matrix}{L_{2} = {{\sum\limits_{r = 1}^{R}{P\left( Y^{(r)} \middle| \lambda \right)}} = {\sum\limits_{r = 1}^{R}{\alpha_{T^{(r)} + 1}^{(r)}\left( {I + 1} \right)}}}} & \left\lbrack {{formula}\quad 26} \right\rbrack\end{matrix}$

and

(9) if |L₁−L₂|/L₁>ε, then putting L₂=L₁ and going to step (4), orotherwise terminating the procedure.

ε in the step (9) is an appropriately small positive number thatdetermines a width of convergence, and a practical value is selected forit according to the particular situation.

In such a manner, a continuous probability distribution HMM is obtained.Based on this, the present invention provides a discrete probabilitydistribution HMM by a procedure comprising the following steps.

(1) By clustering learning vectors, M clusters are calculated; Theclusters are referred to as C₁, C₂, ———, C_(m), C_(M). A centroid of thecluster C_(m) is referred to as y_(mo).

(2) A vector series y^((r))(1), y^((r))(2) ———, y^((r))(T^((r))) oftraining pattern is converted to a centroid series z^((r))(1),z^((r))(2), ———, z^((r))(T^((r))); and

(3) By estimating that the centroid series of step (2) is a set oflearning patterns, using the continuous type HMM is obtained and, anincidence of C_(m) (m=1, . . . , M) in each state of the HMM isdetermined using the continuous type HMM.

Here, a variety of methods can be considered for defining the incidenceof labels. In other words, (a) an incidence probability density of thecentroid of C_(m) in state i, (b) the mean value or median of theprobability density of the learning vectors classified in C_(m), and (c)normalization in (a) and (b) so that their sum for the clusters is 1. Inthe case of (b) the mean can be an arithmetic mean, a geometric mean, ora harmonic mean. Here, as an embodiment of the present invention, themethod (b) using an arithmetic mean and without normalization isdescribed: b_(i)(y) used in a formula below is obtained from anestimated parameter of the continuous type HMM. In this case, anincidence b_(im) of the cluster C_(m) in the state i is given by:$\begin{matrix}{b_{im} = {\frac{1}{K^{m}}{\sum\limits_{k = 1}^{K^{m}}{b_{i}\left( {y_{m}(k)} \right)}}}} & \left\lbrack {{formula}\quad 27} \right\rbrack\end{matrix}$

wherein:

K^(m) is the total number of the feature vectors belonging to thecluster C_(m); and

Y_(m)(k) is the k-th feature vector among the feature vectors belongingto the cluster C_(m).

For the clustering method noted in step (1) above, such a known methodcan be used, for example, the known LBG method (not described in detailhere). As data to be clustered, an entire set of feature vectorsconstituting a pattern that corresponds to a vocalized word v=1˜V usedin learning of the HMM can be used.

FIGS. 2 and 3 show an embodiment of an HMM generator of the presentinvention, and the structure and operation are described.

A feature extracting part 101 converts sound signals of training wordsr=1˜R_(V) that are prepared for generating a model corresponding to aword v (=1, . . . , V) to a series of feature vectors:

[Formula 28]

Y ^(v(r))=(y ^(v(r))(1), y ^(v(r))(2), . . . , y ^(v(r))(T ^(v(r))))

formula 28 by a known method.

A word pattern memory part 102 is a memory means such as a RAM, ROM andvarious discs, and stores R^(v) training words for generating a modelλ^(v) in the form of the series of feature vectors series.$\begin{matrix}{\sum\limits_{v = 1}^{U}{\sum\limits_{r = 1}^{R^{v}}T^{v{(r)}}}} & \left\lbrack {{formula}\quad 29} \right\rbrack\end{matrix}$

A clustering part 103 clusters feature vectors of formula 29 in numbersthat are stored to cluster in the word pattern memory part 102. Here, alabel of the cluster number m is defined as C_(m), and its centroid asy_(om).

A cluster vector memory part 104 stores respective vectors and centroidsof m clusters obtained by the clustering part 103 in a form referencedby m.

A vector quantizing part 105 converts vectors of a feature vector seriesconstituting a training pattern of a word v that is stored in the wordpattern memory part 102 to a centroid vector closest thereto by usingthe centroids in the cluster vector memory part 104. In part 105, aninput vector y^(v(r))(t) is converted to a centroid z^(v(r))(t).

A buffer memory 106 temporarily stores R^(v) word patterns in relationto v converted at the vector quantizing part 105.

A parameter estimating part 107 executes the steps (1) to (9) forgeneration of the model λ^(v), considering that z^(v(r))(1),z^(v(r))(2), ———z^(v(r))(T^(v(r))) is a set of training pattern, andestimates the model λ^(v) corresponding to the word v.

A first parameter memory part 108 primarily stores a re-estimated valueof a parameter obtained in step (6). Re-estimation by the parameterestimating part 107 is performed by using values in the parameter memorypart 108.

A label incidence calculating part 109 calculates probability densitiesof vectors y_(m)(1), ———, y_(m)(K^(m)) of the cluster C_(m) that arestored in the cluster vector memory part 104 for v=1, . . . , V, i=1, .. . , I, m=1, . . . , M from a probability density function of modelλ^(v) stored in the parameter memory part 108, and calculates anincidence b^(v) _(im) of Cm in the state i of HMM of the word vaccording to the formula 27.

A second parameter memory part 110 is a memory means for storingparameters corresponding to a word v=1˜V, and parameters correspondingto the respective word v=1, . . . , V are stored in parameter memoryportion 1, . . . , parameter memory portion V, respectively. In otherwords, a transition probability corresponding to each state of therespective words is read from the first parameter memory part 108, andstored in a form which is able to be referenced by v, i, j. In addition,a label incidence in each state of the respective words is read from thelabel incidence calculating part 109, and stored in a form which is ableto be referenced by v, i, m.

In such a manner, a discrete probability distribution HMM is generated.

As a result, in the present invention, a continuous probabilitydistribution HMM is firstly generated, a set of vectors forming a set ofpatterns used in learning was clustered, and an incidence b_(im) ofvectors included in a cluster m is obtained in the state i of the HMM byusing a probability density obtained as the continuous probabilitydistribution type HMM, and converted to a discrete probabilitydistribution type HMM.

Succeedingly, a method and device for recognizing actual sound inputs isdescribed by using the following model.

FIG. 5 is a block diagram of the recognizing device, and the structureand operation are simultaneously described.

A feature extracting part 401 has a structure and function identical tothose of the feature extracting part 101 of FIG. 2.

In a codebook 403, centroids of clusters stored in the cluster vectormemory part of the HMM generator of FIGS. 2 and 3 are stored.

A vector quantizing part 402 calculates a distance between a featurevector y(t) of an output of the feature extracting part 401 and arepresentative vector y_(om)(m=1, . . . , M) of the respective clustersstored in the codebook 403, and converts a feature vector series to alabel series by substituting for y(t) a label of cluster correspondingto the respective vector closest to y(t).

A parameter memory part 404 has a structure and function identical tothat of the parameter 101 of FIG. 3, and a parameter of a modelcorresponding to the word v (=1, . . . , V) is stored in a parametermemory portion v.

A likelihood calculating part 405 calculates a likelihood of the modelsto a label series obtained at an output of the vector quantizing part402 by using the content of the parameter memory part 404. In otherwords, in a likelihood calculating portion v, the content of theparameter memory portion v is used. For a likelihood calculating method,either formulae 1, 2 or 3 can be used.

A comparing and determining part 406 comparatively determines themaximum output of those of likelihood calculating portions 1, . . . , Vincluded in the likelihood calculating part 405, and outputs a wordcorresponding thereto as a recognition result, thus, executing acalculation corresponding to the formula 4.

A recognition result is obtained from the comparing and determining part406.

Although an embodiment for recognition of a word has been describedabove, in the present invention, it is obvious that the word may bereplaced by a phone, syllable and the like, and it is also applicable toother patterns in addition to speeches.

Moreover, although a distribution of feature vectors has been described,in the embodiment, as being in accordance with a single normaldistribution in each state, it is obvious that the present inventionprovides a more accurate label incidence by using a so-called mixeddistribution.

Furthermore, the present invention is applicable not only to speechrecognizing devices, but also to other time series signal processingfields.

In addition, the means of the present invention may be achieved softwarewise by using a computer, or using dedicated hardware circuits with suchfunctions.

As appreciated from the above description, since the present inventioncomprises vector quantizing means for quantizing vectors of a trainingpattern comprising a vector series, and converting the vectors into alabel series of clusters to which they belong, continuous distributionprobability density HMM generating means for generating a continuousdistribution probability density HMM from a quantized vector seriescorresponding to each label of the label series, and label incidencecalculating means for calculating incidence of the labels in each statefrom the training vectors classified in the same clusters and thecontinuous distribution probability density HMM, an estimation error dueto insufficiency and bias of training data that is a problem, of adiscrete type HMM can be eliminated, and a model that makes use of anadvantage of the discrete type HMM, that, is, fewer calculations, can beachieved.

What is claimed is:
 1. An HMM generator, comprising: vector quantizingmeans for generating a model by quantizing vectors of a training patternhaving a vector series, and converting said quantizing vectors into alabel series of clusters to which they belong, continuous probabilitydistribution density HMM generating means for generating a continuousprobability distribution density HMM from a quantized vector seriescorresponding to each label of said label series of clusters, and labelincidence calculating means for calculating the incidence of the labelsin each state from said quantizing vectors of a training patternclassified in the same label series of clusters and the continuousprobability distribution density HMM.
 2. An HMM memory device,comprising: state transition probability memory means for storing astate transition probability obtained by the HMM (Hidden Markov Model)generator according to claim 1, and label incidence calculating memorymeans for storing a label incidence in each state.
 3. A likelihoodcalculating device, comprising: the vector quantizing means according toclaim 1 for converting the quantized vector series to label series ofclusters by substituting labels for vectors of a feature vector seriesthat constitute an input pattern for the vector quantizing means, andlikelihood calculating means for calculating, from a state transitionprobability and label incidence stored in an HMM memory device, thelikelihood of HMM described by parameters stored in the HMM memorydevice to the input pattern, said HMM memory device comprising an HMMgenerator having vector quantizing means for generating a model byquantizing vectors of a training pattern having a vector series, andconverting said quantizing vectors into a label series of clusters towhich they belong, continuous probability distribution density HMMgenerating means for generating a continuous probability distributiondensity HMM from a quantized vector series corresponding to each labelof said label series of clusters, and label incidence calculating meansfor calculating the incidence of the labels in each state from saidquantizing vectors of a training pattern classified in the same labelseries of clusters and the continuous probability distribution densityHMM.
 4. A recognizing device comprising the likelihood calculatingdevice according to claim 3 for each recognition unit, wherein: thelikelihood of the recognition models to the input signals is calculated,and the recognition unit to which the input signal corresponds, isdetermined from the likelihood.
 5. An HMM (Hidden Markov Model)generator according to claim 1, wherein the label incidence calculatingmeans provides a probability density of quantized vectors correspondingto the clusters from a probability density function of the continuousprobability distribution density HMM in state i, and recognizes theprobability density as incidence b_(im) of C_(m) in the state i, whereC_(m) (m=1, . . . M) is the cluster.
 6. An HMM generator according toclaim 5, wherein the label incidence calculating means includes aincidence normalizing means further calculating b_(im)=b_(im)/(b_(il)+———+b_(iM)) from the b_(im) and recognizes the normalized incidenceb_(im) as incidence of C_(m) in the state i.
 7. An HMM generatoraccording to claim 5, wherein the label incidence calculating meansincludes a incidence normalizing means further calculatingb_(im)=b_(im)/(b_(il)+———+b_(iM)) from the b_(im) and recognizes thenormalized incidence b_(im) as incidence of C_(m) in the state i.
 8. AnHMM generator according to claim 1, wherein the label incidencecalculating means calculates a probability distribution density of saidquantizing vectors of a training pattern including a c_(m) from aprobability distribution density function of the continuous probabilitydistribution density HMM in state i, includes characteristic values suchas the mean and median of the probability distribution density, andrecognizes a characteristic value as an incidence b_(im) of c_(m) in thestate i, where c_(m) (m=1, . . . M) is the cluster.
 9. An HMM generator,comprising: word pattern memory means for storing a training pattern andgenerating a series of feature vectors; vector quantizing meansconnected to said word pattern memory means for quantizing vectors of atraining pattern received from said word pattern memory means andconverting said quantizing vectors into a label series of clusters towhich they belong; buffer memory means connected to said vectorquantizing means for temporarily storing training word patterns of aword converted at said vector quantizing means; parameter estimatingmeans connected to said buffer memory means for generating a modelcorresponding to said word converted at said vector quantizing means;parameter memory means connected to said parameter estimating means forstoring re-estimated values of at least a transition probability forvarious states; and label incidence calculating means connected to saidparameter memory means for calculating the incidence of the labels ineach state from said quantizing vectors of a training pattern classifiedin the same label series of clusters and the continuous probabilitydistribution density HMM.
 10. The HMM generator according to claim 9,wherein said parameter memory means includes a first parameter memorypart and a second parameter memory part connected to said firstparameter memory part and said label incidence calculating means. 11.The HMM generator according to claim 10, wherein said second parametermemory part includes a plurality of parameter memory portions forstoring parameters corresponding to a respective word.
 12. The HMMgenerator according to claim 11, further comprising: clustering partmeans connected to said word pattern memory means for clustering saidfeature vectors as cluster members and generating a label of clustermembers and its centroid.
 13. The HMM generator according to claim 12,further comprising: cluster vector memory means connected to saidclustering part means, said vector quantizing means and said labelincidence calculating means for storing respective vectors and centroidsof cluster generated in said clustering part means.